Cross-situational and supervised learning in the emergence of communication 
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Scenarios for the emergence or bootstrap of a lexicon involve the repeated interaction between at 
least two agents who must reach a consensus on how to name objects using H words. Here we 
consider minimal models of two types of learning algorithms: cross-situational learning, in which 
the individuals determine the meaning of a word by looking for something in common across all 
observed uses of that word, and supervised operant conditioning learning, in which there is strong 
feedback between individuals about the intended meaning of the words. Despite the stark differences 
between these learning schemes, we show that they yield the same communication accuracy in the 
realistic limits of large A'^ and H, which coincides with the result of the classical occupancy problem 
of randomly assigning A'^ objects to H words. 



I. INTRODUCTION 

How a coherent lexicon can emerge in a group of inter- 
acting agents is a major open issue in the language evolu- 
tion and acquisition research area (Hurford, 1989; Nowak 
& Krakauer, 1999; Steels, 2002; Kirby, 2002; Smith, 
Kirby, & Brighton, 2003). In addition, the dynamics in 
the self-organization of shared lexicons is one of the issues 
to which computational and mathematical modeling can 
contribute the most, as the emergence of a lexicon from 
scratch implies some type of self-organization and, pos- 
sibly, threshold phenomenon. This cannot be completely 
understood without a thorough exploration of the pa- 
rameter space of the models (Baronchelli, Felici, Loreto, 
Caglioli, & Steels, 2006). 

There are two main research avenues to investigate 
the emergence or bootstrapping of a lexicon. The first 
approach, inspired by the seminal work of Pinker and 
Bloom (1990) who argued that natural selection is the 
main design principle to explain the emergence and com- 
plex structure of language, resorts to evolutionary algo- 
rithms to evolve the shared lexicon. The key element here 
is that an improvement on the communication ability of 
an individual results, in average, in an increase of the 
number of offspring it produces (Hurford, 1989; Nowak & 
Krakauer, 1999; Cangelosi, 2001; Fontanari & Perlovsky, 
2007, 2008). The second research avenue, which we will 
follow in this paper, argues for a culturally based view 
of language evolution and so it assumes that the lexicons 
are acquired and modified solely through learning during 
the individual's lifetime (Steels, 2002; Smith, Kirby, & 
Brighton, 2003). 

Of course, if there is a fact about language which is 
uncontroversial, it is that the lexicon must be learned 
from the active or passive interaction between children 
and language-proficient adults. The issue of whether this 
ability to learn the lexicon is due to some domain-general 
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learning mechanism, or is an innate ability, unique to 
humans, is still on the table (Bates & Elman, 1996). In 
the problem wc address here, there is simply no language- 
proficient individuals, so it is not so far-fetched to put 
forward a biological rather than a cultural explanation for 
the emergence of a self-organized lexicon. Nevertheless, 
in this contribution we will use many insights produced 
by research on language acquisition by children (see, e.g., 
Gleitman, 1990; Bloom, 2000) to study different learning 
strategies. 

From a developmental perspective, there are basi- 
cally two competing schemes for lexicon acquisition by 
children (Rosenthal & Zimmerman, 1978). The first 
scheme, termed cross-situational or observational learn- 
ing, is based on the intuitive idea that one way that 
a learner can determine the meaning of a word is to 
find something in common across all observed uses of 
that word (Pinker, 1984; Gleitman, 1990; Siskind, 1996). 
Hence learning takes place through the statistical sam- 
pling of the contexts in which a word appears. Since 
the learner receives no feedback about its inferences, we 
refer to this scheme as unsupervised learning. The sec- 
ond scheme, known generally as operant conditioning, in- 
volves the active participation of the agents in the learn- 
ing process, with exchange of non-linguistic cues to pro- 
vide feedback on the hearer inferences. This supervised 
learning scheme has been applied to the design of a sys- 
tem for communication by autonomous robots - the so- 
called language game in the Talking Heads experiments 
(Steels, 2003). Despite the technological appeal, the em- 
pirical evidence is that most part of the lexicon is ac- 
quired by children as a product of unsupervised learning 
(Pinker, 1984; Gleitman, 1990; Bloom, 2000). 

Interestingly, from the perspective of evolving or boot- 
strapping a lexicon, the unsupervised scheme is very at- 
tractive too, since it eliminates altogether the issue of 
honest signaling (Dawkins & Krebs, 1978), as no signal- 
ing is involved in the learning process, which requires 
only observation and some elements of intuitive psychol- 
ogy (e.g. Theory of Mind). 

Many different computational implementations and 
variants of these two schemes for bootstrapping a lexicon 
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have been proposed in the hteraturc. For example, Smith 
(2003a, 2003b), Smith, Smith, Blythc, & Vogt (2006), 
and De Beule, De Vylder, & Belpaeme (2006) have ad- 
dressed the unsupervised learning scheme, whereas Steels 
& Kaplan (1999), Ke, Minett, Au, Wang (2002), Smith, 
Kirby, & Brighton, (2003), and Lenaerts, Jansen, Tuyls, 
& De Vylder (2005), the supervised scheme. However, 
except for the extensive statistical analysis of a variant 
of the supervised learning algorithm which reduces the 
problem to that of naming a single object (Baronchelli, 
Felici, Loreto, Caglioli, & Steels, 2006), the study of the 
effects of changing the parameters of those models have 
been usually limited to the display of the time evolu- 
tion of some measure of the communication accuracy of 
the population. Although at first sight the supervised 
learning scheme may seem to be clearly superior to the 
unsupervised one (albeit less realistic in the context of 
language acquisition by children), we are not aware of 
any thorough comparison between the performances of 
these two learning scenarios. In fact, in this contribution 
we show that in a realistic limit of very large lexicon sizes 
the supervised and unsupervised learning performances 
are essentially identical. 

In this paper we study minimal models of the super- 
vised and unsupervised learning schemes which preserve 
the main ingredients of these two classical language ac- 
quisition paradigms. For the sake of simplicity, here 
we interpret the lexicon as a mapping between objects 
and words (or sounds) rather than as a mapping be- 
tween meanings (conceptual structures) and sounds. A 
more complete scenario would involve first the creation 
of meanings, i.e., the bootstrapping of an object-meaning 
mapping (Steels, 1996; Fontanari, 2006) and then the 
emergence of a meaning-sound mapping (see, e.g., Smith, 
2003a, 2003b; Fontanari & Perlovsky, 2006). 



II. MODEL 

Following a common assumption in lexicon bootstrap- 
ping models, such as the popular iterated learning model 
(Smith, Kirby, & Brighton, 2003; Brighton, Smith, & 
Kirby, 2005 ) , we consider here only two agents who play 
in turns the roles of speaker and hearer. The agents live 
in a fixed environment composed of N objects and have 
H words available to name these objects. As we are in- 
terested in the limit where N and H are very large with 
the ratio a = H/N finite we do not need to account 
for the possibility of creation of new words as in some 
variants of the supervised learning scheme (Baronchelli, 
Felici, Loreto, Caglioh, & Steels, 2006). 

We assume that each agent is characterized by a x iJ 
verbalization matrix P the entries of which pnh & [0, 1], 
with Pnh £ [0,1] for all values of n = 1,...,A^, being 
interpreted as the probability that object n is associated 
with word h. This assumption rules out the existence 
of objects without names, but it allows for words which 
are never used to name objects. To describe the commu- 



nicative behavior of the agents through the verbalization 
matrix (i.e., the associations between objects and words 
for use both in production and interpretation) we need 
to specify how the speaker chooses a word for any given 
object as well as how the hearer infers the object the 
speaker intended to name by that word. 

To name an object, say object n, the speaker sim- 
ply chooses the word h* which is associated to the 
largest entry of row n of the matrix P, i.e., h* = 
maxh {Pnh, h = 1, . . . ,H}. In addition, to guess which 
object the speaker named by word h the hearer selects 
the object that corresponds to the largest of the N entries 
Pnh, n = 1, . . . ,N. In other words, the hearer chooses 
the object that it itself would be most likely to associate 
with word h (Smith, 2003a, 2003b). This amounts to 
assuming that the agents are endowed with a 'Theory of 
Mind' (ToM), i.e., that the hearer is somehow able to 
understand that the speaker thinks similar to itself and 
hence would behave likewise when facing the same situa- 
tion (Donald, 1991). We note that the original inference 
scheme, termed "obverter" (Oliphant & Batali, 1997), 
assumed that the hearer has access to the verbalization 
matrix of the speaker (through mind reading, as the crit- 
ics were ready to point out). Here we follow the more rea- 
sonable scheme, dubbed "introspective obverter" (Smith, 
2003a), which requires endowing the agents with a The- 
ory of Mind rather than with telepathic abilities. 

Effective communication takes place when the two 
agents reach a consensus on which word must be assigned 
to each object. To achieve this, we miist provide a pre- 
scription to modify their initially random verbalization 
matrices. Here we will consider two learning procedures 
that differ basically on whether the agents receive feed- 
back (supervised learning) or not (unsupervised learn- 
ing) about the success of a communication episode. But 
before doing this we need to set up the language game 
scenario where the agents interact. 

From the list of N objects, the agent who plays the 
speaker role chooses randomly C objects without replace- 
ment. This set of C objects forms the context. Then the 
speaker chooses randomly one object in the context and 
produces the word associated to that object, according 
to the procedure sketched before. The hearer has access 
to that word as well as to the C objects that comprise 
the context. Its task is to guess which object in the con- 
text is named by that word. This is then an ambiguous 
language acquisition scenario in which there are multiple 
object candidates for any word. Once the verbalization 
matrices are updated the two agents interchange the roles 
of speaker and hearer and a new context is generated fol- 
lowing the same procedure. 

To control the convergence properties of the learning 
algorithms described next we assume that the entries 
Pnh arc discrete variables that can take on the values 
0, 1/M, 2/M, . . . , 1 — 1/M, 1. In our simulations we choose 
M = 10^. The reciprocal of M can be interpreted as the 
algorithm learning rate. In addition, as there arc two 
agents who alternate in the roles of speaker and hearer. 
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henceforth we wiU add the superscripts I or J to the ver- 
baUzation matrix in order to identify the agent it corre- 
sponds to. At the beginning of the language game each 
agent has a different, randomly generated verbalization 
matrix. More pointedly, to generate the row n of we 
distribute with equal probability M balls among H slots 
and set the value of entry as the ratio between the 
number of balls in slot h and the total number of balls M. 
An analogous procedure is used to set the initial value of 



A. Unsupervised learning 

In this scheme, the list of objects in the context 
Ui , . . . , ric and the accompanying word h* is the only in- 
formation fed to the learning algorithm. Hence, in the 
unsupervised scheme, only the hearer's verbalization ma- 
trix is updated. Of course, since the agents change roles 
at each learning episode, the verbalization matrices of 
both agents are updated during the learning stage. For 
concreteness, let us assume that agent / is the speaker 
and so agent J is the hearer in a particular learning 
episode. As pointed out before, the idea here is to model 
the cross-situational learning scenario (Siskind, 1996) in 
which the agents infer the meaning of a given word by 
monitoring its occurrence in a variety of contexts. Ac- 
cordingly, the learning procedure increases the entries 
Pnih* ' ■ ■ ■ ^Pn h' ^^'^ amount 1/M. In addition, for 
each object in the context, say rii, a word, say /i, is cho- 
sen randomly and the entry is decreased by the same 
amount 1/M, thus keeping the correct normalization of 
the rows of the verbalization matrix. (The possibility 
that h = h* is not ruled out.) This procedure which is in- 
spired by Moran's model of population genetics (Ewens, 
2004) guarantees a minimum disturbance in the verbal- 
ization matrix and can be interpreted as the lateral in- 
hibition of the competing word-object associations. We 
note that during the learning stage the agent playing the 
hearer role does not need to guess which object in the 
context is named by word h*. 

An extra rule is needed to keep the entries within 
the unit interval [0,1]: we assume that once an entry 
reaches the values = 1 or = it becomes fixed, so 
the extremes of the unit interval act as absorbing barriers 
for the stochastic dynamics of the learning algorithm. 



B. Supervised learning 

The setting is identical to that described before ex- 
cept that now the hearer must guess which object in the 
context the speaker named by h* and then communi- 
cate its choice to the speaker (using some nonlinguistic 
means, such as pointing to the chosen object). In turn, 
the speaker must provide another nonlinguistic hint to 
indicate which object in the context it named by word 
h* . Let us assume that the speaker associates word h* to 



object ni. If the hearer's guess happens to be the correct 
one, then both entries p^^/^. and p;^^/^. are incremented 
by the amount 1/M. Furthermore, two words, say hg 
and hfi, are chosen randomly and the entries p^^i^ and 
Pnihf ari^ decreased by 1/M so the normalization of row 
ni is preserved in both verbalization matrices. Suppose 
now the hearer's guess is wrong, say, object n2 instead 
of ni. Then both entries p^^/j. and p^^h' decreased 
by the amount 1/M and, as before, two words hs and 
are chosen randomly and the entries P^^^fi^ and Pn^hh 
are increased by 1/M. As in the unsupervised case, the 
extremes p^'/ = 1 and p^'^ = are absorbing barriers. 

The weak point of this learning scheme is the need for 
nonlinguistic hints to communicate the success or failure 
of the communication episode. This implies that, prior 
to learning, the agents are already capable to commu- 
nicate (and understand) sophisticated meanings such as 
success and failure and behave (by updating their ver- 
balization matrices) accordingly. In fact, feedback about 
the outcome of the communication episode may be seen 
as a form of telepathic meaning transfer. 

III. RESULTS 

Simulation experiments of the two learning algorithms 
described above show, not surprisingly, that after a tran- 
sient the two agents become identical, in the sense that 
they are described by the same verbalization matrix. In 
addition, in the case of unsupervised learning the stochas- 
tic dynamics always leads to binary verbalization matri- 
ces, i.e., matrices whose entries pnh can take on the val- 
ues 1 or only. Of course, once the dynamics produces 
a binary matrix it becomes frozen. This same outcome 
characterizes the supervised case as well, except in the 
cases that the lexicon size H is on the same order of the 
context size C. However, as we focus on the regime where 
C is finite and A'^ and H are large we can guarantee that 
the stochastic dynamics leads to binary verbalization ma- 
trices regardless of the learning procedure. 

Once the dynamics becomes frozen (and so the learn- 
ing stage is over) we measure the average communication 
error e as follows. The speaker chooses object n from the 
list of N objects and emits the corresponding word (there 
is a unique word assigned to any given object, i.e., there is 
a single entry 1 in any row of the verbalization matrix). 
The hearer must then infer which object is named by 
that word. Since the same word can name many objects 
(i.e., there may be many entries 1 in a given column), 
the probability </>„ that the hearer's guess is correct is 
simply the reciprocal of the number of objects named by 
that word. This probability is the communication accu- 
racy regarding object n. The procedure is repeated for 
the N objects, so the average communication error is de- 
fined as e = 1 — (j) where </) = (pn/N is the average 
communication accuracy of the algorithm. 

As already pointed out, the normalization condition 
on the rows of the verbalization matrix P allows for the 
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FIG. 1: Communication error e as function of the ratio a = 
H/N between the number of words H and the number of 
objects iV for iV = 16(v), 24(A) and 96(0)- The open (filled) 
symbols represent the data for the unsupervised (supervised) 
algorithm. The error bars are smaller than the symbol sizes. 
The solid line is the result of the extrapolation for A'^ — > oo 
(see Fig. [2| whereas the dashed line represents the optimal 
performance 1 — a. The parameters are C = 2 and M = 10*. 



possibility that a certain number of words are not used 
by the lexicon acquisition algorithms. Let < H stand 
for the actual number of words used by those algorithms. 
Then we can easily convince ourselves that Hu 't''^ 
simply by noting that 4>n = when the sum is re- 
stricted to objects that are associated to the same word. 
Finally, we note that in the definitions of these commu- 
nication measures the context plays no role at all; indeed 
the context is relevant only during the learning stage. 

It is important to estimate the optimal (minimum) 
communication error tm m our learning scenario since, 
in addition to being a lower bound to the communica- 
tion error produced by the learning algorithms, it allows 
us to rate their absolute performances. For H < N the 
optimal communication error is obtained by making a 
one-to-one assignment between H — 1 words and H ~ 1 
objects, and then assigning the single remaining word to 
the remaining N — H + I objects. This procedure yields 
Cm — i — H/N = 1 — a. For H > N we can obtain 
e„i — simply by discarding H — N words and making 
a one-to-one word-object assignment with the other N 
words. In fact, using our finding that 4> = Hu/N we see 
that, as expected, the optimal performance is obtained 
by setting Hu =^ H H < N tmd Hu = N H > N. 

Figure [1] shows the comparison between the optimal 
performance and the actual performances of the two 
learning algorithms as function of the ratio a. In this, 
as well as in the other figures of this paper, each symbol 
stands for the average over 10^ independent samples or 
language games. The performance of the supervised algo- 
rithm deteriorates as the number of objects N increases. 



FIG. 2: Dependence of the communication error e on the 
reciprocal of the number of objects 1/N for a = 0.5 for the 
unsupervised (o) and supervised (•) learning algorithms. The 
error bars are smaller than the symbol sizes. The linear fit- 
tings (solid straight lines) yield e — 0.5690 ± 0.0003 (unsu- 
pervised) and e = 0.5677 ± 0.0004 (supervised) for A'^ oo. 
The Monte Carlo estimate of the error for the random assign- 
ment of objects to words is given by the symbols x and the 
dashed horizontal line corresponds to the estimate of Eq. ([3} , 
e,. = 0.5677. The parameters are C = 2 and M = 3 10*. 

in contrast to that of the unsupervised algorithm which 
actually shows a slight improvement in this case. For 
N oo, both algorithms produce the same communica- 
tion error (see Fig. [2]), which is shown by the solid line in 
Fig. [TJ We note that a preliminary comparative analysis 
of these algorithms for = 8 led to an incorrect claim 
about the general superiority of the supervised learning 
scheme (Fontanari & Perlovsky, 2006). For small val- 
ues of a the performances of the two learning algorithms 
are practically indistinguishable from the optimal perfor- 
mance, but as we will argue below the algorithms actually 
never achieve that performance, except for a — 0. 

It is instructive to calculate the communication error in 
the case that the N objects are assigned randomly to the 
H words. This is a classical occupancy problem discussed 
at length in the celebrated book by Feller (1968). In this 
occupancy problem, the probability Pm that the number 
of words m not used in the assignment of the N objects 
to the H words (i.e., m = H — Hu) is 




(1) 

which in the limits — > cx) and H ^ oo reduces to the 
Poisson distribution 

p(m;A)=e-^^ (2) 
to! 

where A = Hex-p{—N/H) remains bounded (Feller, 
1968). Hence the average communication accuracy re- 
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FIG. 3: Communication error e of the unsupervised lexi- 
con acquisition algorithm for context size C = 4 and A*' — 
24(v), 36(A), 48(x), and 96(0)- The error bars are smaller 
than the symbol sizes. The learning rate is 1/Af — 10~* and 
the solid line is the result of Eq. Q. 



suiting from the random assignment of objects to words 
is simply {H — (m)) /N, which yields the commmiication 
error 

e^, = 1-a + ae"^/". (3) 

Surprisingly, this equation describes perfectly the com- 
munication error of the two learning algorithms in the 
limit N ~^ (X (solid line in Fig. [T|). We note that the 
(small) discrepancy observed in Fig. [2] for the extrapo- 
lated data of the unsupervised algorithm and the analyt- 
ical prediction can be reduced to zero by decreasing the 
learning rate 1/M. Equation ^ explains also why the 
performances of the algorithms are practically indistin- 
guishable from the optimal performance for small a, since 
the difference between them vanishes as exp( — 1/a). In 
addition, Eq. ([3]) shows that in the limit of large a, the 
communication error vanishes as 1/a. 

A word is in order about the effect of the context size C 
on the performance of the two learning algorithms, since 
Figs. [T] and m exhibit the results for C = 2 only. Simula- 
tions for larger values of C show that this parameter is 
completely irrelevant for the performance of the super- 
vised algorithm. Of course, this is expected since regard- 
less of the context size, at most two rows (object labels) 
of the verbalization matrices are updated. But the situ- 
ation is far from obvious for the unsupervised algorithm 
since C determines the number of rows to be updated in 
each round of the game. However, the results summa- 
rized in Fig. [3] for C = 4 indicate that, despite strong 
finite-size effects particularly for small a, the communi- 
cation error ultimately tends to in the limit of large 
N. 



IV. CONCLUSION 

In this paper we have unveiled two remarkable results. 
First, the supervised and unsupervised schemes for boot- 
strapping a lexicon yield the same communication accu- 
racy in the limit of very large lexicon sizes. For finite lex- 
icon sizes the supervised scheme always outperforms the 
unsupervised one, but its performance degrades as the 
lexicon size increases, whereas the performance of the 
unsupervised learning algorithm improves slightly with 
increasing lexicon size (see Fig.[T]). Second, those perfor- 
mances tend to the communication accuracy obtained by 
a random occupancy problem in which the N objects are 
assigned randomly to the H words. These findings reveal 
a surprising inefficiency of traditional lexicon bootstrap- 
ping scenarios when evaluated in the realistic regime of 
very large lexicon sizes. It would be most interesting 
to devise sensible scenarios that reproduce the optimal 
communication performance or, at least, that exhibit an 
communication error that decays faster than the random 
occupancy result, 1/a = N/H, in the case the number 
of available words is much greater than the number of 
objects {H :$> N). 

The scenarios studied here are easily adapted to model 
the problem of lexicon acquisition (rather than boot- 
strapping): we have just to assume that one of the agents, 
named the master in this case, knows the correct lexi- 
con and so its verbalization matrix is kept fixed during 
the entire learning procedure; the verbalization matrix 
of the other agent - the pupil - is allowed to change fol- 
lowing the update algorithms described before (see, e.g., 
Fontanari, Tikhanoff, Cangelosi, Ilin, & Perlovsky, 2009). 
Most interestingly, in this context, statistical world learn- 
ing has been observed in controlled experiments involv- 
ing infants (Smith & Yu, 2008) and adults (Yu & Smith, 
2007). Similar experiments, but now aiming at boot- 
strapping a lexicon, could be easily carried out by re- 
placing our virtual agents by two adults, who would then 
resort to some conscious or unconscious mechanism to 
track the co-occurrence of words and objects. Of course, 
the very emergence of pidgin - a means of communication 
between two or more groups which lack a common lan- 
guage (Thomason & Kaufman, 1988) - can be seen as a 
realization of such an experiment and serves as additional 
justification for the study of lexicon bootstrapping. 
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